My name is Tomasz Plata-Przechlewski and I live in Poland. I was born on 16th june 1963 (it was Sunday, the exact day when Valentina Vladimirovna Tereshkova was launched into space – if you know who she is).
BTW in Poland born-in-sunday means work-shy (ie. lazy) person (so you know now first Polish? proverb)
BTW by pure statistics \(1/7 \approx 14\)% of the population is work-shy:-)
I graduated economy long time ago and teached statistics and information systems (mainly). I am a big fan of open source software (or OSS) and I knew a few OSS systems including Linux and LaTeX. And of course R which I am about to show you in a while.
My hobby is Road Cycling and History. A I am also a amateur photographer. (cf tprzechlewski@flickr)
Statistics (nothing spectacular, just classical EDA)
Statistical software (modern, non-standard or hipster #youcall)
Poland (via statistical examples)
Theory (models) + Tools (programs) + Practice (real data)
Undergraduate courses in social sciences in Poland concentrate on theory, use Spreadsheet as an universal computing tool Office-like editor (MS Word/OO Writes) as an universal publishing tool. Students works with artificial (clean) and small data sets thus are unaware of problems related to applying theory to practice.
It is claimed that the above scenario is optimal. More advanced tools would be too difficult (and time consuming) to be acquinted to by students, thus distracting them from the main subject of the course, ie statistical methods.
Office sofware has limits. Spreadheets are good for number crunching, but are not so good in: data cleaning (Practice), advanced graphics, spatial analysis (T), team work(Practice). Office editors or Powerpoint/ are great tools but are not quality publishing of statistical results.
In my (humble) opinion it is completny wrong not to use some modern tools even in introductory courses as it is (often) the only lectures undergraduta students complete.
I will try to demonstrate that using modern tools for statistical analysis is the way to go. That (some) modern tools are not more difficult that office software (at higher than basic level)
Conclusion: less theory, more pratice and common sense.
Number of students.
Who is a student?
Student is a person attending to a 3rd level status school in in the 3-stage education system (cf Educational_stage). The answer is still non-obvious as there are many forms of teriary education. For example:
The UNESCO stated that tertiary education focuses on learning endeavors in specialized fields. It includes academic and higher vocational education.
So according to the above definition the school do not belongs to tertiary education if its status is not academic and/or higher vocational. Example: Dance Academy or University for Elderly people (aka University of the 3rd Age). Both are popular in Poland.
In many countries there are some certification scheme. For example in Poland a school must apply (and get) a certificate to be regarded as high school (ie part of tertiary level of education)
Heads vs Majors
Student can be enrolled to more than one course (major). So for counting heads it is necessary to remove duplicates otherwise one would count majors not persons.
Part time studies
FTE stands for Full-Time-Equivalent, an approximation of the number of students who would be enrolled full-time
Full time equivalent (FTE) – FTE is based on student credit hours. It is obtained by dividing student credit hours by some a number of credit hours for full-time-study.
Conclusion: Majors, Persons or FTEs? Which is the best?
University of Utah/Office of Analysis, Assessment and Accreditation google:single multiple majors fte
Who is a tourist. According to Glossary:Tourism
Tourism means the activity of visitors taking a trip to a main destination outside their usual environment, for less than a year, for any main purpose, including business, leisure or other personal purpose, other than to be employed by a resident entity in the place visited.
According to the above definition to be regarded as tourist one has to change her/his accomodation place for less than one year (otherwise Eurostat would regard her/him as migrat)
The usual meaning (at least in Poland) is that tourist is travelling for leasure not to work. Poeple travelling to work has other needs/aims than those travelling to rest (they usually do not use hotels for example) so the above definition solves some problems but at the same time creates many others.
Number of tourists: do not distinguish between various form of turists, difficult to collect (who is a turist anyway?)
Various `number of’ tourist-oriented establishments (hotels, catering units, beds, nights spent) etc. They do not measure turists per-se but are highly related and more reliable (as easier to count).
Indicator of turist activity (by various tourist types).
Conclusion: measurement of tourism activity is not trival Other similar: internet user, migrant, unemployed person, illiterate person
Tourism supply statistics (accommodation statistics): Data on rented accommodation ie. capacity and occupancy of tourist accommodation establishments in the reporting country. How collected? Registers?
Quirks of data collection: Data up to year 2015 inclusive refer to only those units that made the statistical reports. Starting of data for January 2016, the method of imputation data was implemented (ie replacing missing data with some (possibly meaningful :-)) values. (cf BDL)
Tourism demand statistics: Data on participation in tourism of the residents of the reporting country. How collected? Surveys?
Most of the time, data on domestic and outbound trips (where “outbound tourism” means residents of a country travelling in another country) is collected via sample surveys (cf Annual data on trips of EU residents and Tourism_statistics_-_top_destinations)
Regulations concerning data collection in turism (hundreds of pages): Glossary:Supply_side_tourism_statistics and EU regulation No 692/2011
So now we know what we are dealing with…
Share of nights spent at EU-28 tourist accommodation by tourists travelling outside their own country of residence, 2017 Share of nights spent at EU-28 tourist accommodation by tourists travelling outside their own country
Country of residence -> Foreign country (estimated data)
| year | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 |
|---|---|---|---|---|---|---|---|---|---|---|
| # | 10173237 | 9609447 | 10064628 | 10620264 | 11876599 | 12471268 | 12992241 | 13757657 | 15579225 | 16705215 |
## [1] 10173237 9609447 10064628 10620264 11876599 12471268 12992241
## [8] 13757657 15579225 16705215
to be continued…
and international comarison of GDP
** ADD HERE **
(univariate analysis)
The CSV file hotele_caloroczne_PL.csv contains data on number of all-season hotels in every county in Poland. First one has to load the dataset with the read.csv command:
d <- read.csv("hotele_caloroczne_PL.csv", sep = ';', header=T, na.string="NA")
Computing measures of central tendency (with summary and/or fivenum)
summary(d)
## teryt powiat hotele2012 hotele2017
## Min. : 201 bielski : 2 Min. : 0.000 Min. : 0.00
## 1st Qu.:1005 brzeski : 2 1st Qu.: 3.000 1st Qu.: 4.00
## Median :1636 grodziski : 2 Median : 5.000 Median : 7.00
## Mean :1721 krośnieński: 2 Mean : 8.776 Mean : 10.31
## 3rd Qu.:2475 nowodworski: 2 3rd Qu.: 10.000 3rd Qu.: 11.00
## Max. :3263 opolski : 2 Max. :158.000 Max. :183.00
## (Other) :368 NA's :1
fivenum(d$hotele2017)
## [1] 0 4 7 11 183
Computing mean:
mean(d$hotele2017)
## [1] 10.31053
And dispersion:
var(d$hotele2012); var(d$hotele2017)
## [1] NA
## [1] 244.8743
sd(d$hotele2012); sd(d$hotele2017)
## [1] NA
## [1] 15.64846
Second attempt (and more compact output):
c(var(d$hotele2012, na.rm=T), var(d$hotele2017, na.rm=T),
sd(d$hotele2012, na.rm=T), sd(d$hotele2017, na.rm=T))
## [1] 183.11094 244.87430 13.53185 15.64846
BTW:
c( mean(d$hotele2012, na.rm=T), mean(d$hotele2017, na.rm=T))
## [1] 8.775726 10.310526
Or more formally. There were 8.7757256 hotels on the average in every county in Poland in 2012 while in 2017 there were 10.3105263 hotels.
Interquartile Range aka IQR which is the range from the upper (75%) quartile to the lower (25%) quartile. IQR represents central 50% observations of a population. IQR is a robust measure of dispersion, unaffected by the distribution of data:
c( IQR(d$hotele2012, na.rm=T), IQR(d$hotele2017, na.rm=T))
## [1] 7 7
Finally we can equally easilly assess the skewenss:
Some variables by definition are positive (or non-negative): income, market share.
Decoration,
One graph is more effective than another if its quantitative information can be decoded more quickly/easily [Robbins 2005]
Recommended: (ordered) dot plots, bar charts, histograms and kernel density estimates, stripcharts, multipanel displays (instead of stacked bars multiple line/dot plots) scatterplots (two variables)
Not recommended: Pie charts, bubble charts, stacked bar charts,
Bar/line/pie charts introduced by James Playfair in XVIII century. Dot plots introduced by Cleveland (1980s). Box-plots introduced by Tukey (1970s)
Never use (pseudo) 3D charts for 2D data. Virtually no-one can read them
Jittering: adding random noise to data to avoid overlapping.
Before we continue with statistical graphicsa short 2 slides diversion on geocode standards used in statistics.
No doubt in every reliable survey the population has to be precisely defined ie 3 dimensions of every surveyed unit should be fixed: definition (what), time (when measured), space (where)…
I always repet to my students: if you look at some data (in the media for example), start from establishing if you know what, when and where. If no information (or reliable link–called source–to information) is provided on any of the fixed dimensions of data, treat this data as rubbish and do not waste time to use/analyse it.
Further dissemination of such defective data should be subjected to publicly prosecuted (joke)
I tried to show you already that what is complicated and often highly unreliable/arbitrary (the nature of the phenomenon or/and measurement difficulties).
What dimension much more simpler due to universal standard, ie. time. You gather data or for a certain moment (how many hotels are in use in 31st December 2018) or for certain period of time (how many beds were sold in these hotels in 3rd quarter of 2018).
Where dimensions in turn is usually based on administrative or statistical (geographical) units (country, state/province, county, community). But contrary to time dimension there is no universal or globally-accepted standard for geostatistical units. Usually such a standard is based on administrative system which is country-dependent.
The administrative division of Poland since 1999 has been based on three levels of subdivision (cf Administrative divisions of Poland. In 2001 as Poland became a member of European Union, EU regulations are part of national law system.
EU regulates everything, statistics included.
Conclusion: The pigs had to expend enormous labours every day upon mysterious things called “files,” “reports,” “minutes,” and “memoranda.” These were large sheets of paper which had to be closely covered with writing, and as soon as they were so covered, they were burnt in the furnace (George Orwell, Animal Farm)
The Nomenclature of Territorial Units for Statistics (NUTS) is a geocode standard for referencing the subdivisions of countries for statistical purposes. The standard is developed and regulated by the European Union, and thus only covers the member states of the EU in detail (cf NUTS)
NUTS standard was revised several times (on the average every 4 years :-)), so there is even a page at ec.europa.eu domain dedicated to NUTS (short) history (cf NUTS history)
NUTS1 (level) – macroregion, NUTS2 – state, NUTS3 – county We would like to plot a chart showing number of hotels.
Poland is divided into 16 states (NUTS2) and 380 counties NUTS3 which are equal to administrative units. So on the average there ar 23.75 counties per state. NUTS1 level is only for statistical purposes (but regions are in fact distinct due to history, economics, natural-conditions, cultural factors etc… )
There is a relevant and interesting page by GUS (Main Statistical Office or Główny Urząd Statystyczny), but unfortunately in Polish (use google translate :-) in case you are interested or mail me) (cf Klasyfikacja NUTS w Polsce
The above map shows 7 macroregions (NUT1) and 16 provinces (NUTS2). BTW provice is Polish is “prowicja” (due to both are from Latin) but actually Polish administrative provice is called “województwo”, from “wodzić” – ie commanding (the armed troops in this context). This is an old term/custom from the 14th century, where Poland was divided into provinces (every provice ruled by a “wojewoda” ie chief of that province). More can be found at Wikipedia (cf Administrative divisions of Poland
NUTS3 consists of 380 counties (called “powiat”). In ancient Poland powiat was called “starostwo” and the head of a “starostwo was called”starosta“.”Stary" means Old, so “starosta” is an old (and thus wise) person. BTW the head of powiat is “starosta” as 600 years ago:-)
There is no NUT4 level but there is 3rd level of Polish administration used by GUS (Main Statistical Office). This 3rd level is called “gmina” (community).
There are (approximately) 2750 communities in Poland. As Poland population is 38,5 mln and the area equals 312,7 sq kilometers (120 persons per 1 sqkm) on the average each powiat has 820 sqkm and each community has 113.5 sqkm or approximately 100 thousand persons per “powiat and 14 thousand per”gmina“.
TERYT is a Polish NUTS (developed in 50 years ago). It is complex system which includes identification of administrative units. Every unit has (up to) a 7-digit id number: wwppggt where ww = “województwo” id, pp = “powiat” id, gg = “gmina” id and “t” decodes type-of-community (rural, municipal or mixed). Higher units has trailing zeros for irrelevant part of id, so 14 or 1400000 means the same; as well as 1205 and 1205000. Six numbers is enough to identify a community (approx 2750 units).
So you are now experts on administrative division of Poland, and we can go back to statistical charts…
A strip chart (strip plot) shows the distribution of data points along a numerical axis.These plots are suitable compared to box plots when sample sizes are small (because preserve more information about the data).
Histograms show the distribution of a set of data. To draw a histogram the numbers (observations) are grouped into bins (intervals or classes). There is a tradeoff between showing details or showing an overall picture. When bin width changes the scale at Y-axis changes as well (more bins less observations in each bin).
ggplot(d, aes(x = hotele2017)) +
geom_histogram(bins = nclass.Sturges(d$hotele2017))
Histograms with binwidth equal to 20,10, 5 and 1 respectively:
Kernel density functions
ggplot(data=d) + geom_density(aes(x=hotele2017))
p1 <- ggplot(data=d) + geom_density(aes(x=hotele2017), adjust=0.25)
p2 <- ggplot(data=d) + geom_density(aes(x=hotele2017), adjust=1.0)
p3 <- ggplot(data=d) + geom_density(aes(x=hotele2017), adjust=2.0)
p4 <- ggplot(data=d) + geom_density(aes(x=hotele2017), adjust=8.0)
ggarrange(p1,p2,p3,p4)
Box-plots are much better than histograms for comparing distributions of more than one data sets.
Construction of a (typical) box-plot: The middle bar is a median. Top/bottom bars of the rectangle shows the IQR (interquartille range is 1st and 3rd
quartille), the fanciful bars above/below rectangle called whiskers (google: whiskers mustache :-) are 1,5 times the IQR (or minimu/maximum if those values are less than plus/minus 1,5 IQR. The symbols above/below whiskers (usually open circles) are outliers (non typical/extreme values)
Note the trick: outliers are defined not as (for example) top/botom 1% fraction of values (every distribution would has outliers in such a case) but as values less/more than Me - 1,5IQR (distributions with medium variablity would not have outliers)
Example: age of Nobel-prize winners (cf The Nobel Prize API Developer Hub)
d <- read.csv("nobel_laureates3.csv", sep = ';', dec = ",", header=T, na.string="NA");
ggplot(d, aes(x=category, y=age, fill=category)) + geom_boxplot() + ylab("years") + xlab("");
## Warning: Removed 39 rows containing non-finite values (stat_boxplot).
Multiple histograms are too detailed (binwidth=5). It is impossible for example to establish which category has the youngest (on the average) laureate, or which category has an oldest one (economics and literature are candidates, but due to multimodality of literature laureates distribution it is difficult to assess this for sure…)
ggplot(d, aes(x=age, fill=category)) + geom_histogram(binwidth=5) +
facet_grid(category ~ .)
## Warning: Removed 39 rows containing non-finite values (stat_bin).
A scatter-plot (aka scatter diagrams, xyplot) is a basic form used for two (quantitive) variables.
To see the relationship between variables, a line is can be fitted. Least square (LS) line which assumes linear relationship between variables, is fitted by minimizing the sum of squares of the residuals (residual is the difference between a data-point and a relevant line-point ie a point computed from the formula y = a +bx where x is the value of the x-axis variable.)
Alternatively loess curve can be used which do not assumes linearity.
Logarithmic scale makes it possible to plot values with too wide range for a linear scale. Base 10 logarithms squeeze' the numbers more than base 2 logarithms (log10(100)=2 wile log2(100)=6.64. Moreover is the original scale contains multiplications of 10 use log10 to getnice’ log-scale while it contains multiplications of 2 use log2.
Logarithms transforms additive scale to `multiplicative’ one. Example (Nobel prize again):
dA <- read.csv("nobel_laureates3.csv", sep = ';', dec = ",", header=T, na.string="NA");
nrow(dA)
## [1] 934
dS <- subset(dA, (! bornCountryCode == "" )) ## by country of birth
nrow(dS) ## how many
## [1] 901
aggregate by bornCountryCode
Finally plot the resulting data using various Y-axis scales (arithmetic, log2 and log10)
Position along common scale Position along common but nonaligned scales Length Angle (slope) Area Volume Color (hue), Color (saturation), Color (density of black)
Angle judgement is not precise. Acute angles are underestimated while obtuse angles (greater than 90) are overestimated.
Area judgement is biased as well. It is impossible to distinguish small differences in area, while quite easy when the same date is plotted along common scale
The most accurate of graphic task is positioning along common scale
To visualize n-dimensional data do not use more dimensions than n.
Always include 0 in numerical axes
** ADD **
The ratio between the width and the height of a rectangle is called its aspect ratio.
The aspect ratio describes the area that is occupied by the data in the chart. A change in aspect ratio changes the perception of the graph. The question is which aspect ratio is the best.
We can recognize change most easily if absolute slopes equals to 45 degree angle on the graph. It is much harder to see change if the curves are nearly horizontal/vertical. The idea (Cleveland, 1988) behind banking is therefore to adjust the aspect ratio of the entire plot in such a way that most slopes are at an approximate 45 degree angle.
Setting the aspect ratio so that the average of the values of the orientations is 45 degrees is called “banking the average orientation to 45 degrees”.
Setting the aspect ratio so that the weighted mean of line segments (weighted by segments’ length is approx 45 degrees is called average weighted orientation method (to 45 degrees).
*** Example ***
Geocoding and reversegecoding
QGis
The determinants of the tourist traffic in the castle’s museum of Malbork
icos-crop-data exploring-coffee-production-and-consumption ico-coffee-crop-data-data-wrangling
So you probably still wander why I am punishing myself with using such a odd system. The most important argument why I will present momentarily and it concerns the basic approach (philospohy if one has to be phatetic) of doing statistical analysis.
This mode (or concept) is called Reproducible Research (RR in short).
Serious statistical analysis is not one-off job. There is a value-chain as well as a life cycle of statistical analysis. Value chain means that there are distinct stages while life cycle that the same data/models are used for years and most statistical analysis do not start from the scrach but are based on data from the past augmented with new data. The problem is that the new data and model modifications should be in-sync with the past.
The make the problem worse, serious statistics should be also in-sync with the work of others (to ease or to make possible any meaningful (international) comparisons for example)
R/Rstudio for computing and data visualization
Github for enhancing team work
markdown for reproducible research
Introduce reproducible research approach
Use real (big and dirty) data sets.
Introduce some programming (Programing or using mouse?)
Introduce some new tools (R/Rstudio, QGIS, Github)